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Abstract 

In most classical frameworks for learning from examples, it is assumed that examples are randomly drawn 
and presented to the learner. In this paper, we consider the possibility of a more active learner who is 
allowed to choose his/her own examples. Our investigations are carried out in a function approximation 
setting. In particular, using arguments from optimal recovery (Micchelli and Rivlin, 1976), we develop an 
adaptive sampling strategy (equivalent to adaptive approximation) for arbitrary approximation schemes. 
We provide a general formulation of the problem and show how it can be regarded as sequential optimal 
recovery. We demonstrate the application of this general formulation to two special cases of functions 
on the real line 1) monotonically increasing functions and 2) functions with bounded derivative. An 
extensive investigation of the sample complexity of approximating these functions is conducted yielding 
both theoretical and empirical results on test functions. Our theoretical results (stated in PAC-style), 
along with the simulations demonstrate the superiority of our active scheme over both passive learning 
as well as classical optimal recovery. The analysis of active function approximation is conducted in a 
worst-case setting, in contrast with other Bayesian paradigms obtained from optimal design (Mackay, 
1992). 
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1 Introduction 

In the classical paradigm of learning from examples, 
the data or examples, are typically drawn according to 
some fixed, unknown, arbitrary probability distribution. 
This is the case for a wide class of problems in the 
PAC (Valiant, 1984) framework, as well as other familiar 
frameworks in the connectionist (Rumelhart et al, 1986 
) and pattern recognition literature. In this important 
sense, the learner is merely a passive recipient of infor- 
mation about the target function. In this paper, we con- 
sider the possibility of a more active learner. There are of 
course a myriad of ways in which a learner could be more 
active. Consider, for example, the extreme pathological 
case where the learner simply asks for the true target 
function which is duly provided by an obliging oracle. 
This, the reader will quickly realize is hardly interesting. 
Such pathological cases aside, this theme of activity on 
the part of the learner has been explored (though it is 
not always conceived as such) in a number of different 
settings (PAC-style concept learning, boundary-hunting 
pattern recognition schemes, adaptive integration, opti- 
mal sampling etc.) in more principled ways and we will 
comment on these in due course. 

For our purposes, we restrict our attention in this pa- 
per to the situation where the learner is allowed to choose 
its own examples 1 , for function approximation tasks. In 
other words, the learner is allowed to decide where in the 
domain D (for functions defined from D to Y) it would 
like to sample the target function. Note that this is 
in direct contrast to the passive case where the learner 
is presented with randomly drawn examples. Keeping 
other factors in the learning paradigm unchanged, we 
then compare in this paper, the active and passive learn- 
ers who differ only in their method of collecting exam- 
ples. At the outset, we are particularly interested in 
whether there exist principled ways of collecting exam- 
ples in the first place. A second important consideration 
is whether these ways allow the learner to learn with a 
fewer number of examples. This latter question is partic- 
ularly important as one needs to assess the advantage, 
from an information-theoretic point of view, of active 
learning. 

Are there principled ways to choose examples? We 
develop a general framework for collecting (choosing) 
examples for approximating (learning) real- valued func- 
tions. This can be viewed as a sequential version of op- 
timal recovery (Michhelli and Rivlin, 1977): a scheme 
for optimal sampling of functions. Such an active learn- 
ing scheme is consequently in a worst-case setting, in 
contrast with other schemes (Mackay, 1992; Cohn, 1993; 
Sollich, 1993) that operate within a Bayesian, average- 



1 This can be regarded as a computational instantiation 
of the psychological practice of selective attention where a 
human might choose to selectively concentrate on interesting 
or confusing regions of the feature space in order to better 
grasp the underlying concept. Consider, for example, the sit- 
uation when one encounters a speaker with a foreign accent. 
One cues in to this foreign speech by focusing on and then 
adapting to its distinguishing properties. This is often ac- 
complished by asking the speaker to repeat words which are 
confusing to us. 



case paradigm. We then demonstrate the application of 
sequential optimal recovery to some specific classes of 
functions. We obtain theoretical bounds on the sam- 
ple complexity of the active and passive learners, and 
perform some empirical simulations to demonstrate the 
superiority of the active learner. 

2 A General Framework For Active 
Approximation 

2.1 Preliminaries 

We need to develop the following notions: 
T: Let T denote a class of functions from some domain 
D to Y where Y is a subset of the real line. The do- 
main D is typically a subset of R k though it could be 
more general than that. There is some unknown target 
function / £ T which has to be approximated by an 
approximation scheme. 

V: This is a data set obtained by sampling the target 
/ £ T at a number of points in its domain. Thus, 

V = {(xi,yi)\xi £ D, yi = f(x t ), i=l...n} 

Notice that the data is uncorrupted by noise. 

%: This is a class of functions (also from D to Y) from 
which the learner will choose one in an attempt to ap- 
proximate the target /. Notationally, we will use % 
to refer not merely to the class of functions (hypothe- 
sis class) but also the algorithm by means of which the 
learner picks an approximating function h £ % on the 
basis of the data set V . In other words, % denotes an ap- 
proximation scheme which is really a tuple < %,A > . 
A is an algorithm that takes as its input the data set V , 
and outputs an h £ %. 

Examples: If we consider real- valued functions from R k 
to R, some typical examples of % are the class of poly- 
nomials of a fixed order (say q), splines of some fixed 
order, radial basis functions with some bound on the 
number of nodes, etc. As a concrete example, consider 
functions from [0, 1] to R. Imagine a data set is collected 
which consists of examples, i.e., (xi,yi) pairs as per our 
notation. Without loss of generality, one could assume 
that Xi < Xi + i for each i. Then a cubic (degree-3) spline 
is obtained by interpolating the data points by poly- 
nomial pieces (with the pieces tied together at the data 
points or "knots") such that the overall function is twice- 
differentiable at the knots. Fig. 1 shows an example of 
an arbitrary data set fitted by cubic splines. 

<1q : We need a metric to determine how good the ap- 
proximation learner's approximation is. Specifically, the 
metric dc measures the approximation error on the re- 
gion C of the domain D. In other words, dc, takes as its 
input any two functions (say /i and $2) from D to R and 
outputs a real number. It is assumed that dc satisfies 
all the requisites for being a real distance metric on the 
appropriate space of functions. Since the approximation 
error on a larger domain is obviously going to be greater 
than that on the smaller domain, we can make the follow- 
ing two observations: 1) for any two sets C\ and C'2 such 




Figure 1: An arbitrary data set fitted with cubic splines 



that Ci C C 2 , dc^fuh) < d Ca (fi,h), 2)d D (f } ,f 2 ) is 

the total approximation on the entire domain; this is our 

basic criterion forjudging the "goodness" of the learner's 

hypothesis. 

Examples: For real-valued functions from R k to R, the 

L p c metric defined as d c tfi,h) = (J c |/i - hfdxflf 

serves as a natural example of an error metric. 



C: This is a collection of subsets C of the domain. We are 
assuming that points in the domain where the function is 
sampled, divide (partition) the domain into a collection 
of disjoint sets C; G C such that U" =1 C'i = D. 
Examples: For the case of functions from [0, 1] to R, 
and a data set V , a natural way in which to partition the 
domain [0, 1] is into the intervals [x{, «i+i), (here again, 
without loss of generality we have assumed that X{ < 
Xi + i). The set C could be the set of all (closed, open, or 
half-open and half-closed) intervals [a,b] C [0,1]. 

The goal of the learner (operating with an approxima- 
tion scheme 7i) is to provide a hypothesis h G H. (which 
it chooses on the basis of its example set V) as an ap- 
proximator of the unknown target function / G T . We 
now need to formally lay down a criterion for assessing 
the competence of a learner (approximation scheme). In 
recent times, there has been much use of PAC (Valiant 
1984) like criteria to assess learning algorithms. Such a 
criterion has been used largely for concept learning but 
some extensions to the case of real valued functions ex- 
ist (Haussler 1989). We adapt here for our purposes a 
PAC like criterion to judge the efficacy of approximation 
schemes of the kind described earlier. 

Definition 1 An approximation scheme is said to P- 
PAC learn the function f G T if for every e > and 
1 > 6 > 0, and for an arbitrary distribution P on D, 
it collects a data set V , and computes a hypothesis h G 
% such that do(h, /) < e with probability greater than 
1 — 6. The function class T is P-PAC learnable if the 
approximation scheme can P-PAC learn every function 
in T . The class T is PAC learnable if the approximation 



scheme can P-PAC learn the class for every distribution 
P. 

There is an important clarification to be made about 
our definition above. Note that the distance metric d 
is arbitrary. It need not be naturally related to the 
distribution P according to which the data is drawn. 
Recall that this is not so in typical distance metrics 
used in classical PAC formulations. For example, in 
concept learning, where the set T consists of indicator 
functions, the metric used is the L\{P) metric given by 
d(lA, Is) = f D | lji — ls\P(^)dx. Similarly, extensions 
to real- valued functions typically use an L'^iP) metric. 
The use of such metrics imply that the training error is 
an empirical average of the true underlying error. One 
can then make use of convergence of empirical means to 
true means (Vapnik, 1982) and prove learnability. In our 
case, this is not necessarily the case. For example, one 
could always come up with a distribution P which would 
never allow a passive learner to see examples in a certain 
region of the domain. However, the arbitrary metric d 
might weigh this region heavily. Thus the learner would 
never be able to learn such a function class for this met- 
ric. In this sense, our model is more demanding than 
classical PAC. To make matters easy, we will consider 
here the case of P — P AC learnability alone, where P 
is a known distribution (uniform in the example cases 
studied). However, there is a sense in which our notion 
of PAC is easier — the learner knows the true metric d 
and given any two functions, can compute their rela- 
tive distance. This is not so in classical PAC, where the 
learner cannot compute the distance between two func- 
tions since it does not know the underlying distribution. 

We have left the mechanism of data collection un- 
defined. Our goal here is the investigation of different 
methods of data collection. A baseline against which 
we will compare all such schemes is the passive method 
of data collection where the learner collects its data set 
by sampling D according to P and receiving the point 
[x, f(x)). If the learner were allowed to draw its own ex- 
amples, are there principled ways in which it could do 
this? Further, as a consequence of this flexibility ac- 
corded to the learner in its data gathering scheme, could 
it learn the class T with fewer examples? These are the 
questions we attempt to resolve in this paper, and we 
begin by motivating and deriving in the next section, a 
general framework for active selection of data for arbi- 
trary approximation schemes. 

2.2 The Problem of Collecting Examples 

We have introduced in the earlier section, our baseline 
algorithm for collecting examples. This corresponds to 
a passive learner that draws examples according to the 
probability distribution P on the domain D. If such a 
passive learner collects examples and produces an out- 
put h such that do(h, f) is less than e with probability 
greater than 1 — 6, it _P-PAC learns the function. The 
number of examples that a learner needs before it pro- 
duces such an (e-good, ^-confidence) hypothesis is called 
its sample complexity. 

Against this baseline passive data collection scheme, 
lies the possibility of allowing the learner to choose its 



own examples. At the outset it might seem reasonable 
to believe that a data set would provide the learner with 
some information about the target function; in particu- 
lar, it would probably inform it about the "interesting" 
regions of the function, or regions where the approxima- 
tion error is high and need further sampling. On the 
basis of this kind of information (along with other infor- 
mation about the class of functions in general) one might 
be able to decide where to sample next. We formalize 
this notion as follows: 

Let V = {{xi,yi);i = l...n} be a data set (con- 
taining n data points) which the learner has access to. 
The approximation scheme acts upon this data set and 
picks an h £ % (which best fits the data according to 
the specifics of the algorithm A inherent in the approx- 
imation scheme). Further, let C\; i = 1, . . ., K(n) 2 be a 
partition of the domain D into different regions on the 
basis of this data set. Finally let 

Fv = {fe F\f(xi) = Vi V(x i>yi ) £ V} 

This is the set of all functions in T which are consistent 
with the data seen so far. The target function could be 
any one of the functions in Tv ■ 

We first define an error criterion ec (where C is any 
subset of the domain) as follows: 

e c (Ti,V,T) = sup d c (h,f) 

Essentially, ec is a measure of the maximum possi- 
ble error the approximation scheme could have (over the 
region C) given the data it has seen so far. It clearly de- 
pends on the data, the approximation scheme, and the 
class of functions being learned. It does not depend upon 
the target function (except indirectly in the sense that 
the data is generated by the target function after all, and 
this dependence is already captured in the expression). 
We thus have a scheme to measure uncertainty (maxi- 
mum possible error) over the different regions of the in- 
put space D. One possible strategy to select a new point 
might simply be to sample the function in the region C\ 
where the error bound is the highest. Let us assume we 
have a procedure V to do this. V could be to sample 
the region C at the centroid of C, or sampling C accord- 
ing to some distribution on it, or any other method one 
might fancy. This can be described as follows: 

Active Algorithm A 

1. [Initialize] Collect one example (xi,y\) by sam- 
pling the domain D once according to procedure 
V. 

2. [Obtain New Partitions] Divide the domain D 
into regions C\, . . . , CV(i) on the basis of this data 
point. 



2 The number of regions K(n) into which the domain D 
is partitioned by n data points depends upon the geometry 
of D and the partition scheme used. For the real line parti- 
tioned into intervals as in our example, K(n) = n + 1. For 
fc-cubes, one might obtain Voronoi partitions and compute 
K(n) accordingly. 



3. [Compute Uncertainties] Compute ec, for each 
i. 

4. [General Update and Stopping Rule] In gen- 
eral, at the jth stage, suppose that our partition 
of the domain D is into C'i,i = l...K(j). One 
can compute ec, for each i and sample the region 
with maximum uncertainty (say C'i) according to 
procedure V. This would provide a new data point 
{xj + i, j/j+i). The new data point would re-partion 
the domain D into new regions. At any stage, if 
the maximum uncertainty over the entire domain 
eo is less than e stop. 

The above algorithm is one possible active strategy. 
However, one can carry the argument a little further and 
obtain an optimal sampling strategy which would give 
us a precise location for the next sample point. Imagine 
for a moment, that the learner asks for the value of the 
function at a point x £ D. The value returned obviously 
belongs to the set 

Tv(x) = {f(x)\f £ T v } 
Assume that the value observed was y £ Tv{x). In effect, 
the learner now has one more example, the pair (x,y), 
which it can add to its data set to obtain a new, larger 
data set V where 

V = VU(x,y) 

Once again, the approximation scheme % would map 
the new data set V into a new hypothesis h' . One can 
compute 

e c (n,V',F)= sup d(h'J) 

Clearly, euili., T>' , T) now measures the maximum 
possible error after seeing this new data point. This 
depends upon (x, y) (in addition to the usual %,V, and 
T). For a fixed x, we don't know the value of y we would 
observe if we had chosen to sample at that point. Con- 
sequently, a natural thing to do at this stage is to again 
take a worst case bound, i.e., assume we would get the 
most unfavorable y and proceed. This would provide the 
maximum possible error we could make if we had chosen 
to sample at x. This error (over the entire domain) is 

sup e£>(?{,V', T) = sup ej){li. ) T>\j{x ) y) ) T) 

Naturally, we would like to sample the point x for which 
this maximum error is minimized. Thus, the optimal 
point to sample by this argument is 

x new = argmin sup e D (7i, V U (x, y), T) (1) 

xeD yer 7 ,( x ) 

This provides us with a principled strategy to choose 
our next point. The following optimal active learning 
algorithm follows: 

Active Algorithm B (Optimal) 

1. [Initialize] Collect one example (xi,yi) by sam- 
pling the domain D once according to procedure 
V. We do this because without any data, the ap- 
proximation scheme would not be able to produce 
any hypothesis. 



2. [Compute Next Point to Sample] Apply eq. 1 
and obtain x 2 - Sampling the function at this point 
yields the next data point (#2,2/2) which is added 
to the data set. 

3. [General Update and Stopping Rule] In gen- 
eral, at the jth stage, assume we have in place a 
data set Vj (consisting of j data). One can com- 
pute Xj+i according to eq. 1 and sampling the func- 
tion here one can obtain a new hypothesis and a 
new data set X'j+i- In general, as in Algorithm A, 
stop whenever the total error eo{'H,'Dk,T) is less 
than e. 

By the process of derivation, it should be clear that 
if we chose to sample at some point other than that ob- 
tained by eq. 1, an adversary could provide a y value and 
a function consistent with all the data provided (includ- 
ing the new data point), that would force the learner to 
make a larger error than if the learner chose to sample 
at x new . In this sense, algorithm B is optimal. It also 
differs from algorithm A, in that it does not require a 
partition scheme, or a procedure V to choose a point in 
some region. However, the computation of x new inherent 
in algorithm B is typically more intensive than computa- 
tions required by algorithm A. Finally, it is worthwhile 
to observe that crucial to our formulation is the deriva- 
tion of the error bound eo( / H,'D,J 7 ). As we have noted 
earlier, this is a measure of the maximum possible error 
the approximation scheme % could be forced to make 
in approximating functions of T using the data set V . 
Now, if one wanted an approximation scheme indepen- 
dent bound, this would be obtained by minimzing eo 
over all possible schemes, i.e., 

m{e D (7i,V,T) 

n 

Any approximation scheme can be forced to make at 
least as much error as the above expression denotes. An- 
other bound of some interest is obtained by removing the 
dependence of eo on the data. Thus given an approx- 
imation scheme %, if data V is drawn randomly, one 
could compute 

P{e D (n,V,T)>e} 

or in an approximation scheme-independent setting, one 
computes 

P{mie D (n,V,r)>e} 

7i 

The above expressions would provide us PAC-like 
bounds which we will make use of later in this paper. 

2.3 In Context 

Having motivated and derived two possible active strate- 
gies, it is worthwhile at this stage to comment on the for- 
mulation and its place in the context of previous work 
in similar vein executed across a number of disciplines. 
1) Optimal Recovery: The question of choosing the 
location of points where the unknown function will be 
sampled has been studied within the framework of opti- 
mal recovery (Micchelli and Rivlin, 1976; Micchelli and 
Wahba, 1981; Athavale and Wahba, 1979). While work 
of this nature has strong connections to our formulation, 



there remains a crucial difference. Sampling schemes mo- 
tivated by optimal recovery are not adaptive. In other 
words, given a class of functions T (from which the tar- 
get / is selected), optimal sampling chooses the points 
Xi G D, 1 : = 1, . . . , n by optimizing over the entire func- 
tion space T . Once these points are obtained, then they 
remain fixed irrespective of the target (and correspond- 
ingly the data set V). Thus, if we wanted to sample the 
function at n points, and had an approximation scheme 
% with which we wished to recover the true target, a 
typical optimal recovery formulation would involve sam- 
pling the function at the points obtained as a result of 
optimizing the following objective function: 

{(*,-, /(*i)).- = l...n})) (2) 



arg min sup d(f, h(V 

x x ,...,x n j^jr 



where h(T> = {(xi, f(xi))i = i „}) G H is the learner's 
hypothesis when the target is / and the function is sam- 
pled at the Xi's. Given no knowledge of the target, these 
points are the optimal to sample. 

In contrast, our scheme of sampling can be conceived 
as an iterative application of optimal recovery (one point 
at a time) by conditioning on the data seen so far. Mak- 
ing this absolutely explicit, we start out by asking for 
one point using optimal recovery. We obtain this point 

by 



arg min sup d(f, h{T>\ 



{(*!,/(*!))})) 



Having sampled at this point (and obtained 2/1 from the 
true target), we can now reduce the class of candidate 
target functions to T\ , the elements of T which are con- 
sistent with the data seen so far. Now we obtain our 
second point by 



arg min sup d(f,h(T>2 

* 2 f€?i 



{(x 1 ,y 1 ),(x2,f(x 2 ))})) 



Note that the supremum is done over a restricted set T\ 
the second time. In this fashion, we perform optimal 
recovery at each stage, reducing the class of functions 
over which the supremum is performed. It should be 
made clear that this sequential optimal recovery is not a 
greedy technique to arrive at the solution of eq. 2. It will 
give us a different set of points. Further, this set of points 
will depend upon the target function. In other words, the 
sampling strategy adapts itself to the unknown target / 
as it gains more information about that target through 
the data. We know of no similar sequential sampling 
scheme in the literature. 

While classical optimal recovery has the formulation 
of eq. 2, imagine a situation where a "teacher" who 
knows the target function and the learner, wishes to com- 
municate to the learner the best set of points to minimize 
the error made by the learner. Thus given a function g, 
this best set of points can be obtained by the following 
optimization 

(3) 



arg min d(g,h({(xi,g(xi))} i=1 ... n )) 

X\ ,...,X n 



Eq. 2 and eq. 3 provide two bounds on the perfor- 
mance of the active learner following the strategy of Al- 
gorithm B in the previous section. While eq. 2 chooses 
optimal points without knowing anything about the tar- 
get, and, eq. 3 chooses optimal points knowing the target 



completely, the active learner chooses points optimally 
on the basis of partial information about the target (in- 
formation provided by the data set). 

2) Concept Learning: The PAC learning community 
(which has traditionally focused on concept learning) 
typically incorporates activity on the part of the learner 
by means of queries, the learner can make of an ora- 
cle. Queries (Angluin, 1988) range from membership 
queries (is x an element of the target concept c) to sta- 
tistical queries (Kearns, 1993 ; where the learner can not 
ask for data but can ask for estimates of functionals of 
the function class) to arbitrary boolean valued queries 
(see Kulkarni etal for an investigation of query complex- 
ity). Our form of activity can be considered as a natural 
adaptation of membership queries to the case of learning 
real-valued functions in our modified PAC model. It is 
worthwhile to mention relevant work which touches the 
contents of this paper at some points. The most signifi- 
cant of these is an investigation of the sample complex- 
ity of active versus passive learning conducted by Eisen- 
berg and Rivest (1990) for a simple class of unit step 
functions. It was found that a binary search algorithm 
could vastly outperform a passive learner in terms of the 
number of examples it needed to (e, 6) learn the target 
function. This paper is very much in the spirit of that 
work focusing as it does on the sample complexity ques- 
tion. Another interesting direction is the transformation 
of PAC-learning algorithms from a batch to online mode. 
While Littlestone etal (1991) consider online learning 
of linear functions, Kimber and Long (1992) consider 
functions with bounded derivatives which we examine 
later in this paper. However the question of choosing 
one's data is not addressed at all. Kearns and Schapire 
(1990) consider the learning of p-concepts (which are 
essentially equivalent to learning classes of real- valued 
functions with noise) and address the learning of mono- 
tone functions in this context. Again, there is no active 
component on the part of the learner. 
3)Adaptive Integration: The novelty of our formu- 
lation lies in its adaptive nature. There are some simi- 
larities to work in adaptive numerical integration which 
are worth mentioning. Roughly speaking, an adaptive 
integration technique (Berntsen etal 1991, book???) di- 
vides the domain of integration into regions over which 
the integration is done. Estimates are then obtained 
of the error on each of these regions. The region with 
maximum error is subdivided. Though the spirit of such 
an adaptive approach is close to ours, specific results in 
the field naturally differ because of differences between 
the integration problem (and its error bounds) and the 
approximation problem. 

4) Bayesian and other formulations: It should be 
noted that we have a worst-case formulation (the supre- 
mum in our formulation represents the maximum pos- 
sible error the scheme might have). Alternate bayesian 
schemes have been devised (Mackay, 1991; Cohn, 1994) 
from the perspective of optimal experiment design (Fe- 
dorov). Apart from the inherently different philosophi- 
cal positions of the two schemes, an indepth treatment 
of the sample complexity question is not done. We 
will soon give two examples where we address this sam- 



ple complexity question closely. In a separate piece of 
work (Sung and Niyogi, 1994) , the author has also 
investigated such bayesian formulations from such an 
information-theoretic perspective. Yet another average- 
case formulation comes from the information-complexity 
viewpoint of Traub and Wozniakovski (see Traub etal 
(1988) for details). Various interesting sampling strate- 
gies are suggested by research in that spirit. We do not 
attempt to compare them due to the difficulty in com- 
paring worst-case and average-case bounds. 

Thus, we have motivated and derived in this section, 
two possible active strategies. The formulation is gen- 
eral. We now demonstrate the usefulness of such a for- 
mulation by considering two classes of real- valued func- 
tions as examples and deriving specific active algorithms 
from this perspective. At this stage, the important ques- 
tion of sample complexity of active versus passive learn- 
ing still remains unresolved. We investigate this more 
closely by deriving theoretical bounds and performing 
empirical simulation studies in the case of the specific 
classes we consider. 

3 Example 1: A Class of Monotonically 
Increasing Bounded Functions 

Consider the following class of functions from the inter- 
val [0, 1] C S R to 3? : 

T = {/ : < / < M, and f(x) > f(y)ix > y} 

Note that the functions belonging to this class need not 
be continuous though they do need to be measurable. 
This class is PAC- learnable (with an Li(P) norm, in 
which case our notion of PAC reduces to the classical 
notion) though it has infinite pseudo-dimension 3 (in the 
sense of Pollard (1984)). Thus, we observe: 

Observation 1 The class T has infinite pseudo- 
dimension (in the sense of Pollard (1984); Haussler 
(1989),). 

Proof: To have infinite pseudo-dimension, it must be 
the case that for every n > 0, there exists a set of 
points {x\, . . . ,Xn) which is shattered by the class T. 
In other words, there must exist a fixed translation vec- 
tor t = (ti, . . . ,t n ) such that for every boolean vector 
b = (&i,...,& n ), there exists a function / £ T which 
satisfies f(xi) — ti > -O- hi = 1. To see that this 
is indeed the case, let the n points be X{ = i/(n + 1) 
for i going from 1 to n. Let the translation vector then 
be given by ti = X{. For an arbitrary boolean vector 
b we can always come up with a monotonic function 
such that f(xi) = i/(n + 1) — l/3(n + 1) if 6; = and 
f( Xi ) = i/(n + 1) + l/3(n + 1) if hi = 1. □ 

We also need to specify the terms Ti, dc, the proce- 
dure V for partitioning the domain D = [0, 1] and so 
on. For our purposes, we assume that the approxima- 
tion scheme Ti is first order splines. This is simply find- 
ing the monotonic function which interpolates the data 



Finite pseudo-dimension is only a sufficient and not 
necessary condition for PAC learnability as this example 
demonstrates. 



in a piece-wise linear fashion. A natural way to parti- 
tion the domain is to divide it into the intervals [0, x\), 
[x\, X2), ■ ■ ■ , [xi, Xi + i), . . . , [x n , 1]. The metric dc is an L p 

metric given by d c (fi , / 2 ) = (/„* |/i - h \ p dx) l IP. 

Note that we are specifically interested in comparing 
the sample complexities of passive and active learning. 
We will do this under a uniform distributional assump- 
tion, i.e., the passive learner draws its examples by sam- 
pling the target function uniformly at random on its do- 
main [0,1]. In contrast, we will show how our general 
formulation in the earlier section translates into a spe- 
cific active algorithm for choosing points, and we derive 
bounds on its sample complexity. We begin by first pro- 
viding a lower bound for the number of examples a pas- 
sive PAC learner would need to draw to learn this class 
T. 

3.1 Lower Bound for Passive Learning 

Theorem 1 Any passive learning algorithm (more 
specifically, any approximation scheme which draws data 
uniformly at random and interpolates the data by any 
arbitrary bounded function) will have to draw at least 
±(M/2e) p la(l/6) examples to P-PAC learn the class 
where P is a uniform distribution. 

Proof: Consider the uniform distribution on [0, 1] and a 
subclass of functions which have value on the region A 
= [0, 1 — (2e/M) p ] and belong to T . Suppose the passive 
learner draws / examples uniformly at random. Then 
with probability (1 — (2e/M) p ) 1 , all these examples will 
be drawn from region A. It only remains to show that 
for the subclass considered, whatever be the function 
hypothesized by the learner, an adversary can force it to 
make a large error. 

Suppose the learner hypothesizes that the function is 
h. Let the value of 

(J(i-(2e/M)M)l /l ( ;c )l P ^) 1/P be X- Obviously < x < 
(M p (2e/M)P) 1 /P = 2e. If x < e, then the adversary can 
claim that the target function was really 

for x G [0, 1 - (2e/M)P] 
M for x G (1- (2e/M) p ,l] 

If, on the other hand x > e > then the adversary can 
claim the function was really g = 0. 

In the first case, by the triangle inequality, 

d(h,g) = (J [0tl] \g-h\Pdx)^P> 



For (1 — (2e/M)P) 1 to be greater than 6, we need 
ln(j). It is a fact that for a < 1/2, 



■ln(l-(2e/M)P) lu \ 6 



g{x) 



(In 



[l-(2e/M)y,l] 



\g-h\Pdx)^P 



— U(l-(2e/M)P,l) M P dx) p (J(i_(2e/M)!\l) I"!*"*) 

= 2e-x>£ 
In the second case, 

d(h,g) = (J [01] \g-h\Pdxy/P> 

(/(i-(2e/M) y ,i)|0-^l P ^) 1/P =X> f 

Now we need to find out how large / must be so that 
this particular event of drawing all examples in A is not 
very likely, in particular, it has probability less than 6. 



I < 

■r- < — r~n 7. Making use of this fact (and setting 

2a — — ln(l — a) ° V o 

a = (2e/M)P, we see that for e < (M)(I) 1 /p ; we have 

\(M/2e)P Ml/8) < _ ln(1 _ ( 1 2e/M)P) ln(|). So unless / is 

greater than \(M/2e)P ln(l/<5), the probability that all 
examples are chosen from A is greater than 6. Con- 
sequently, with probability greater than 6, the passive 
learner is forced to make an error of atleast e, and PAC 
learning cannot take place. □ 

3.2 Active Learning Algorithms 

In the previous section we computed a lower bound 
for passively PAC learning this class for a uniform 
distribution 4 . Here we derive an active learning strategy 
(the CLA algorithm) which would meaningfully choose 
new examples on the basis of information gathered about 
the target from previous examples. This is a specific in- 
stantiation of the general formulation, and interestingly 
yields a "divide and conquer" binary searching algorithm 
starting from a different philosophical standpoint. We 
formally prove an upper bound on the number of exam- 
ples it requires to PAC learn the class. While this upper 
bound is a worst case bound and holds for all functions 
in the class, the actual number of queries (examples) this 
strategy takes differs widely depending upon the target 
function. We demonstrate empirically the performance 
of this strategy for different kinds of functions in the 
class in order to get a feel for this difference. We de- 
rive a classical non-sequential optimal sampling strategy 
and show that this is equivalent to uniformly sampling 
the target function. Finally, we are able to empirically 
demonstrate that the active algorithm outperforms both 
the passive and uniform methods of data collection. 

3.2.1 Derivation of an optimal sampling 

strategy 

Consider an approximation scheme of the sort de- 
scribed earlier attempting to approximate a target func- 
tion / G T on the basis of a data set V . Shown in 
fig. 2 is a picture of the situation. We can assume 
without loss of generality that we start out by know- 
ing the value of the function at the points x = and 
x = 1. The points {2^; i = 1, . . . , n) divide the domain 
into n + 1 intervals C; (i going from to n) where 
d = [x{, Xi + i](xo = C^aJn+i = l).The monotonicity 
constraint on T permits us to obtain rectangular boxes 
showing the values that the target function could take at 
the points on its domain. The set of all functions which 
lie within these boxes as shown is Tv ■ 

Let us first compute ec i ( / H,'D,J 7 ) for some interval 
C'i. On this interval, the function is constrained to lie 
in the appropriate box. We can zoom in on this box as 
shown in fig. 3. 

The maximum error the approximation scheme could 



4 Naturally, this is a distribution-free lower bound as well. 
In other words, we have demonstrated the existence of a dis- 
tribution for which the passive learner would have to draw at 
least as many examples as the theorem suggests. 



have (indicated by the shaded region) is clearly given by 



A 



( / \h-f(x i )fdx) 1 / p = ( / (-xfdx) 1 ^ = AB 1 ' p /(p+l) 1 ' p 



B 




Figure 2: A depiction of the situation for an arbitrary 
data set. The set Tv consists of all functions lying in the 
boxes and passing through the datapoints (for example, 
the dotted lines). The approximating function h is a 
linear interpolant shown by a solid line. 




Figure 3: Zoomed version of interval. The maximum 
error the approximation scheme could have is indicated 
by the shaded region. This happens when the adversary 
claims the target function had the value y{ throughout 
the interval. 



where A = f(x i+1 ) - f(xi) and B = (x i+1 - Xi). 

Clearly the error over the entire domain eo is given 

by 



S D — Z_^ e C. 



(4) 



The computation of ec is all we need to implement an 
active strategy motivated by Algorithm A in section 2. 
All we need to do is sample the function in the interval 
with largest error; recall that we need a procedure V 
to determine how to sample this interval to obtain a 
new data point. We choose (arbitrarily) to sample the 
midpoint of the interval with the largest error yielding 
the following algorithm. 

The Choose and Learn Algorithm (CLA) 

1. [Initial Step] Ask for values of the function at 
points x = and x = 1. At this stage, the domain 
[0, 1] is composed of one interval only, viz., [0, 1]. 
Compute E x = ^-^^(l-O) 1 ^ |(/(1)-/(0))| and 

T\ = E\. If Ti < e, stop and output the linear inter- 
polant of the samples as the hypothesis, otherwise 
query the midpoint of the interval to get a par- 
tition of the domain into two subintervals [0, 1/2) 
and [1/2,1]. 

2. [General Update and Stopping Rule] In gen- 
eral, at the Mh stage, suppose that our partition 
of the interval [0,1] is [xo = 0, x\),[xi, X2), ■ ■ ■ , 
[xk-ijXk = 1]. We compute the normalized error 

Ei = ( F +T)W(^ " Zi-O^K/Or,-) - /(*,--i))| for 
all i = 1, ..,k. The midpoint of the interval with 
maximum Ei is queried for the next sample. The 
total normalized error Tj. = (J2 i=1 E p ) l ' p is com- 
puted at each stage and the process is terminated 
when Tk < e. Our hypothesis h at every stage is 
a linear interpolation of all the points sampled so 
far and our final hypothesis is obtained upon the 
termination of the whole process. 

Now imagine that we chose to sample at a point x £ 
d = [xi, Xi + i] and received the value y £ Tv{x) (i.e., 
y in the box) as shown in the fig. 4. This adds one 
more interval and divides C; into two subintervals C'n 
and C'i2 where C'n = [ie 8 ',:e] and C'i2 = [a:,a:;+i]. We 
also correspondingly obtain two smaller boxes inside the 
larger box within which the function is now constrained 
to lie. The uncertainty measure ec can be recomputed 
taking this into account. 

Observation 2 The addition of the new data point 
(x,y) does not change the uncertainty value on any of 
the other intervals. It only affects the interval C'i which 
got subdivided. The total uncertainty over this interval 



< x ,-. y ,-> 



„„*« 



<x , + ;. y - ) 

1+1 l+l 



• c, 



Figure 4: The situation when the interval C; is sampled 
yielding a new data point. This subdivides the interval 
into two subintervals and the two shaded boxes indicate 
the new constraints on the function. 



is now given by 

e Ci (H,V',r) = (^r) 1/p ((* - *.-)(!/ " f(*i)Y + 

( Xi+1 - x))((f( Xi+1 ) - f( Xi )) - yyy /p = 

= G (zr-P + (B- z)(A - r)P) 1/p 

where for convenience we have used the substitution z = 
x — Xi, r = y — f(xi), and A and B are f(xi + i) — /(#;) 
and Xi+i — Xi as above. Clearly z ranges from to B 
while r ranges from to A. 

We first prove the following lemma: 

Lemma 1 

5/2 = arg min sup G (zr p + (B - z)(A - rff lp 

Proof: Consider any z £ [0,5]. There are three cases 

to consider: 

Case I z > 5/2 : let z = 5/2 + a where a > 0. We find 



su Pr£[0,A] 



G(zrP + (B -z)(A-r)P) 1/p 



(sup re[0iA] G(zrP + (5-z)(A-r)P)) 



i/p 



Now, 



S np re[0iA] G(zrP + (B-z)(A-r)P) = 

sup re[0 ,A] G ((5/2 + a)rP + (5/2 - a)(A - rf) 

= Gsup re[0A] B/2(rP + (A- r)P) + a( r P -(A- rf) 

Now for r = A, the expression within the supremum 
5/2(rP + (A-r)P) + a( r P -(A- rf ) is equal to (5/2 + 
a)A p . For any other r £ [0, A], we need to show that 

5/2(r p + (A - r)P) + a(r p -(A- rf) < (5/2 + a)A p 



B/2((r/A)P+(l-(r/A))P)+a((r/A)P-(l-r/A)P) < B/2+a 

Putting /? = r/A (clearly /? £ [0,1], and noticing that 
(1 - /3)P < 1 - IP and IP - (1 - /3)P < 1 the inequality 
above is established. Consequently, we are able to see 
that 

sup G (zr p + (5 - z){A - r) p ) 1/p = G(B/2 + a) l ' p A 

r£[0,A] 

Case II Let z = 5/2 — a for a > 0. In this case, by 
a similar argument as above, it is possible to show that 
again, 

sup G (zr p + (5 - z){A - r) p ) 1/p = G(B/2 + a) l ' p A 

r£[0,A] 

Case III Finally, let z = 5/2. Here 

su Pre[0 ,A] G {zr-P + (5 - z)(A - r)P) llp = 

G(B/2flPsu Vr( , [0tA] (rP + (A-r)P) llp 

Clearly, then for this case, the above expression is re- 
duced to GA(B /2) 1 ' p . Considering the three cases, the 
lemma is proved. □ 

The above lemma in conjunction with eq. 4 and obser- 
vation 2 proves that if we choose to sample a particular 
interval C; then sampling the midpoint is the optimal 
thing to do. In particular, we see that 

min^^^] sup y€[/(:i , .-, J(a ,. +i)] e Ci {H, V U (x, y), T) = 

(^ T ) 1/p ( £i± l z£i ) 1/p (/(^+i) " f(*i)) = e Ci (H,V,T)l2V* 

In other words, if the learner were constrained to pick 
its next sample in the interval C; , then by sampling the 
midpoint of this interval C;, the learner ensures that the 
maximum error it could be forced to make by a malicious 
adversary is minimized. In particular, if the uncertainty 
over the interval C; with its current data set V is ec, , 
the uncertainty over this region will be reduced after 
sampling its midpoint and can have a maximum value of 

ec,/2 1/p . 

Now which interval must the learner sample to mini- 
mize the maximum possible uncertainty over the entire 
domain D = [0,1]. Noting that if the learner chose to 
sample the interval C; then 

min^c^^^] sup ye:Fv e D=[0)l] {%, V U (x, y),T) = 



Y.U,^ P C (H,V,T) 



e r c .(n,V,T) 



1/p 



From the decomposition above, it is clear that the opti- 
mal point to sample according to the principle embodied 
in Algorithm B is the midpoint of the interval Cj which 
has the maximum uncertainty ec {%,V,T) on the basis 
of the data seen so far, i.e., the data set V . Thus we can 
state the following theorem 

Theorem 2 The CLA is the optimal algorithm for the 
class of monotonic functions 



Having thus established that our binary searching al- 
gorithm (CLA) is optimal, we now turn our efforts to de- 
termining the number of examples the CLA would need 
in order to learn the unknown target function to e accu- 
racy with 6 confidence. In particular, we can prove the 
following theorem. 

Theorem 3 The CLA converges in at most (M/e) p 
steps. Specifically, after collecting at most (M/e) p exam- 
ples, its hypothesis is e close to the target with probability 
1. 

Proof Sketch: The proof of convergence for this algo- 
rithm is a little tedious. However, to convince the reader, 
we provide the proof of convergence for a slight vari- 
ant of the active algorithm. It is possible to show (not 
shown here) that convergence times for the active algo- 
rithm described earlier is bounded by the convergence 
time for the variant. First, consider a uniform grid of 
points (e/M) p apart on the domain [0, 1]. Now imagine 
that the active learner works just as described earlier 
but with a slight twist, viz., it can only query points on 
this grid. Thus at the kth stage, instead of querying the 
true midpoint of the interval with largest uncertainty, it 
will query the gridpoint closest to this midpoint. Ob- 
viously the intervals at the kth stage are also separated 
by points on the grid (i.e. previous queries). If it is the 
case that the learner has queried all the points on the 
grid, then the maximum possible error it could make is 
less than e. To see this, let a = e/M and let us first look 
at a specific small interval [ka, (k + l)a]. We know the 
following to be true for this subinterval: 
f(ka) = h(ka) < f(x),h(x) < f((k + l)a) = h((k + l)a) 
Thus 

\f(x)-h(x)\<f((k + l)a)-f(ka) 
and so over the interval [ka, (k + l)a] 



(fc + l)c 



k 



J k a 



\f(x)-h(x)\Pdx < 
(f((k + l)a)- f{ka)fdx 



<(f((k + l)a)-f(ka))Pa 
It follows that 
f [01] \f-h\Pdx = 

f [0a) \f-h\Pdx + ... + f [1 _ al] \f-h\Pdx< 
«((/(«) - /(0)) p + (/(2«) - f(a))P + ...+ 
(/(I " «) " /(I " 2a)Y + (/(l) - /(l - a))P) < 
«(/(«) " /(0) + /(2a) - /(a) + . . . + /(l) - /(l - a))P 

< a(/(l) - f(0))P < aMP 
So if a = (e/M) p , we see that the L p error would be at 

most ( Jr„ .-, |/ — h\ p dx J < e. Thus the active learner 

moves from stage to stage collecting examples at the grid 
points. It could converge at any stage, but clearly after it 
has seen the value of the unknown target at all the grid- 
points, its error is provably less than e and consequently 
it must stop by this time. □ 




Figure 5: How the CLA chooses its examples. Vertical 
lines have been drawn to mark the x-coordinates of the 
points at which the algorithm asks for the value of the 
function. 



3.3 Empirical Simulations, and other 
Investigations 

Our aim here is to characterise the performance of CLA 
as an active learning strategy. Remember that CLA is 
an adaptive example choosing strategy and the number 
of samples it would take to converge depends upon the 
specific nature of the target function. We have already 
computed an upper bound on the number of samples it 
would take to converge in the worst case. In this section 
we try to provide some intuition as to how this sampling 
strategy differs from random draw of points (equivalent 
to passive learning) or drawing points on a uniform grid 
(equivalent to optimal recovery following eq. 2 as we shall 
see shortly). We perform simulations on arbitrary mono- 
tonic increasing functions to better characterize condi- 
tions under which the active strategy could outperform 
both a passive learner as well as a uniform learner. 

3.3.1 Distribution of Points Selected 

As has been mentioned earlier, the points selected by 
CLA depend upon the specific target function. Shown in 
fig. 3-5 is the performance of the algorithm for an ar- 
bitrarily constructed monotonically increasing function. 
Notice the manner in which it chooses its examples. In- 
formally speaking, in regions where the function changes 
a lot (such regions can be considered to have high in- 
formation density and consequently more "interesting"), 
CLA samples densely. In regions where the function 
doesn't change much (correspondingly low information 
density), it samples sparsely. As a matter of fact, the 
density of the points seems to follow the derivative of 
the target function as shown in fig. 6. 

Consequently, we conjecture that 

Conjecture 1 The density of points sampled by the ac- 
tive learning algorithm is proportional to the derivative 
of the function at that point for differentiable functions. 

Remarks: 




Figure 6: The dotted line shows the density of the sam- 
ples along the x-axis when the target was the monotone- 
function of the previous example. The bold line is a plot 
of the derivative of the function. Notice the correlation 
between the two. 



1. The CLA seems to sample functions according to 
its rate of change over the different regions. We 
have remarked earlier, that the best possible sam- 
pling strategy would be obtained by eq. 3 earlier. 
This corresponds to a teacher (who knows the tar- 
get function and the learner) selecting points for 
the learner. How does the CLA sampling strategy 
differ from the best possible one? Does the sam- 
pling strategy converge to the best possible one as 
the data goes to infinity? In other words, does the 
CLA discover the best strategy? These are inter- 
esting questions. We do not know the answer. 

2. We remarked earlier that another bound on the 
performance of the active strategy was that pro- 
vided by the classical optimal recovery formulation 
of eq. 2. This, as we shall show in the next section, 
is equivalent to uniform sampling. We remind the 
reader that a crucial difference between uniform 
sampling and CLA lies in the fact that CLA is an 
adaptive strategy and for some functions might ac- 
tually learn with very few examples. We will ex- 
plore this difference soon. 

3.3.2 Classical Optimal Recovery 

For an L\ error criterion, classical optimal recovery 
as given by eq. 2 yields a uniform sampling strategy. To 
see this, imagine that we chose to sample the function at 
points xf, i = 1, . . . , n. Pick a possible target function / 
and let yi = f(xi) for each i. We then get the situation 
depicted in fig. 7. The n points divide the domain into 
n+1 intervals. Let these intervals have length a; each as 
shown. Further, if [a;;_i, xi\ corresponds to the interval 
of length a;, then let yi — y%-\ = 6;. In other words we 
would get n + 1 rectangles with sides a; and b{ as shown 
in the figure. 

It is clear that choosing a vector b = (b\, . . . , & n +i)' 
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Figure 7: The situation when a function / £ T is picked, 
n sample points (the x's) are chosen and the correspond- 
ing y values are obtained. Each choice of sample points 
corresponds to a choice of the a's. Each choice of a func- 
tion corresponds to a choice of the 6's. 



with the constraint that ^2 i=1 bi = M and bi > is 
equivalent to defining a set of y values (in other words, 
a data set) which can be generated by some function in 
the class T . Specifically, the data values at the respective 
sample points would be given by y\ = b\, y 2 = b\ -\- b 2 
and so on. We can define Ty. to be the set of mono- 
tonic functions in T which are consistent with these data 
points. In fact, every / £ T would map onto some b, 
and thus belong to some Ty.. Consequently, 

■^ = U {b:i,>o,^i,=M}-^b 

Given a target function / £ Ty. , and a choice of 
n points X{, one can construct the data set V = 
{(xi, f(xi))}i = i n and the approximation scheme gener- 
ates an approximating function h(T>). It should be clear 

that for an L\ distance metric (d(f, h) = f \f — h\dx), 
the following is true: 



n + l 



sup d(f, h) = -J2 



/e*i 



aibi = -a.b 



Thus, taking the supremum over the entire class of 
functions is equivalent to 



sap d(f,h(V)) 



sup 

{b:6,>0,^6,=M} 



-a.b 



10 



The above is a straight forward linear program- 
ming problem and yields as its solution the result 
ttM max{a,', i = 1, . . . , (n + 1)}. 

Finally, every choice of n points Xi,i = 1, . . . , n results 
in a corresponding vector a where a; > and ^a 8 - = 1. 
Thus minimizing the maximum error over all the choice 
of sample points (according to eq. 2) is equivalent to 

argmm Xlt „, tXn svLp fe:F d(f,h(V = {(x t , f(xi))} i=1 ... n ) = 
argmin {a . a .> ^ a . =1} max{ai;i = l...n + 1} 
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Fn. No. 


Avg. Random/CLA 


Avg. Uniform/CLA 


1 


7.23 


1.66 


2 


61.37 


10.91 


3 


6.67 


1.10 


4 


8.07 


1.62 


5 


6.62 


1.56 



Table 1: Shown in this table is the average error rate of 
the random sampling and the uniform sampling strate- 
gies when as a multiple of the error rates due to CLA. 
Thus for the function 3 for example, uniform error rates 
are on an average 1.1 times CLA error rates. The av- 
erages are taken over the different values of N (number 
of examples) for which the simulations have been done. 
Note that this is not a very meaningful average as the dif- 
ference in the error rates between the various strategies 
grow with N (as can be seen from the curves)if there is 
a difference in the order of the sample complexity. How- 
ever they have been provided just to give a feel for the 
numbers. 



We also need to specify some other terms for this 
class of functions. The approximation scheme % is a 
first order spline as before, the domain D = [0, 1] is 
partitioned into intervals by the data [x{, a?;+i] (again 
as before) and the metric dc is an L\ metric given by 
dcifijf'z) = f c \fi( x ) ~ f2(x)\dx. The results in this 
section can be extended to an L p norm but we confine 
ourselves to an L\ metric for simplicity of presentation. 

4.1 Lower Bounds 

Theorem 4 Any learning algorithm (whether passive or 
active) has to draw at least il((rf/e)) examples (whether 
randomly or by choosing) in order to PAC learn the class 
T. 

Proof Sketch: Let us assume that the learner col- 
lects m examples (passively by drawing according to 
some distribution, or actively by any other means). Now 
we show that an adversary can force the learner to make 
an error of atleast e if it draws less than il((rf/e)) exam- 
ples. This is how the adversary functions. 

At each of the m points which are collected by the 
learner, the adversary claims the function has value 0. 
Thus the learner is reduced to coming up with a hy- 
pothesis that belongs to T and which it claims will be 
within an e of the target function. Now we need to show 
that whatever the function hypothesized by the learner, 
the adversary can always come up with some other func- 
tion, also belonging to T , and agreeing with all the data 
points, which is more than an e distance away from the 
learner's hypothesis. In this way, the learner will be 
forced to make an error greater than e. 

The m points drawn by the learner, divides the re- 
gion [0, 1] into (at most) m + 1 different intervals. Let 
the length of these intervals be b\, 62, 63, •••, &m+i- The 
"true" function, or in other words, the function the ad- 
versary will present, should have value at the endpoints 
of each of the above intervals. We first state the following 
lemma. 



Lemma 2 There exists a function f £ T such that f 
interpolates the data and 

\j\dx > — — 

[0,1] 4(m+l) 

where k is a constant arbitrarily close to 1. 

Proof: Consider fig. 11. The function / is indicated by 
the dark line. As is shown, / changes sign at each x = 
X{. Without loss of generality, we consider an interval 
[x{, Xi+i] of length b{. Let the midpoint of this interval 
be z = (x{ + £j'_|_i)/2. The function here has the values 

{d(x — Xi) for x £ [x{, z — a] 

-d(x - x i+1 ) for x £ [z + a, x i+1 ] 

d(x-£?_ + d^-aX for x £ [z 

Simple algebra shows that 



a, z + a\ 



J "' +1 \f\dx > d( b -^f + ad( b -^) = d(bf - a 2 )/4 
Clearly, a can be chosen small, so that 




Figure 11: Construction of a function satisying Lemma 
2. 



\f\dx>*°%- 



where k is as close to 1 as we want. By combining the 
different pieces of the function we see that 



kd 



m + 1 



13 



\f\dx> — J2 b 



Now we make use of the following lemma, 

Lemma 3 For a set of numbers b\, .., b m such that b\ + 
&2 + •• + b m = 1, the following inequality is true 

b\ + bl + .. + bl l >l/m 

Proof: By induction. □ 

Now it is easy to see how the adversary functions. 
Suppose the learner postulates that the true function is 
h. Let Jr„ .-, \h\dx = \- If X > e , the adversary claims that 

the true function was / = 0. In that case L \h — f\dx = 



X > e. If on the other hand, x < e > then the adversary 
claims that the true function was / (as above). In that 
case, 



o 



\f-h\dx> / \f\dx- 
Jo 



'.n, kd 

\n\dx = — — 

o 4(m + l) 



X 



Clearly, if m is less than §7 — 1, the learner is forced 
again to make an error greater than e. Thus in either 
case, the learner is forced to make an error greater than 
or equal to e if less than Q,(d/e) examples are collected 
(howsoever these examples are collected). □ 

The previous result holds for all learning algorithms. 
It is possible to show the following result for a passive 
learner. 

Theorem 5 A Passive learner must draw at least 
max(fi((d/e), \/{d/e) ln( 1/6))) to learn this class. 

Proof Sketch: The d/e term in the lower bound follows 
directly from the previous theorem. We show how the 
second term is obtained. 

Consider the uniform distribution on [0, 1] and a sub- 
class of functions which have value on the region A 
= [0, 1 — a] and belong to T . Suppose the passive learner 
draws / examples uniformly at random. Then with prob- 
ability (1 — a) 1 , all these examples will be drawn from 
region A. It only remains to show that for this event, 
and the subclass considered, whatever be the function 
hypothesized by the learner, an adversary can force it to 
make a large error. 

It is easy to show (using the arguments of the earlier 
theorem) that there exists a function / £ T such that 
/ is on A and J._ \f\dx = ^a 2 d. This is equal to 2e 
if a = ^/(Ae/d). Now let the learner's hypothesis be h. 
Let f._ \h\dx = x- If X i s greater than e, the adversary 
claims the target was g = 0. Otherwise, the adversary 
claims the target was g = /. In either case, f \g — h\dx > 
e. 

It is possible to show (by an identical argument to the 
proof of theorem 1), that unless / > ^ y(d/e) ln( 1/6), all 
examples will be drawn from A with probability greater 
than 6 and the learner will be forced to make an error 
greater than e. Thus the second term appears indicating 
the dependence on 6 in the lower bound. □ 

4.2 Active Learning Algorithms 

We now derive in this section an algorithm which ac- 
tively selects new examples on the basis of information 
gathered from previous examples. This illustrates how 
our formulation of section 3.1.1 can be used in this case 
to effectively obtain an optimal adaptive sampling strat- 
egy- 

4.2.1 Derivation of an optimal sampling 

strategy 

Fig. 12 shows an arbitrary data set containing in- 
formation about some unknown target function. Since 
the target is known to have a first derivative bounded by 
d, it is clear that the target is constrained to lie within 
the parallelograms shown in the figure. The slopes of 



the lines making up the parallelogram are d and — d ap- 
propriately. Thus, Tv consists of all functions which lie 
within the parallelograms and interpolate the data set. 
We can now compute the uncertainty of the approxima- 
tion scheme over any interval, C, (given by ec{1~t, T>, T)), 
for this case. Recall that the approximation scheme % 
is a first order spline, and the data V consists of (x,y) 
pairs. Fig. 13 shows the situation for a particular inter- 
val (C'i = [xi, Xi+i]). Here i ranges from to n. As in 
the previous example, we let xo = 0, and x n+ i = 1. 

The maximum error the approximation scheme % 
could have on this interval is given by (half the area 
of the parallelogram). 

M2R2 



ec i (7i,T>,J 7 )= sup / \h — f\dx 



Ad 



where A t = \f(x i+1 ) - f(xi)\ and 5* = x i+1 - x t . 
Clearly, the maximum error the approximation scheme 
could have over the entire domain is given by 



e D= [ 0A ]{7i,V,T) = sup V^ / \f - h\dx 

fe^ Vj=0 J Cj 



n 
j=0 



( 5 ) 

The computation of ec is crucial to the derivation of 
the active sampling strategy. Now imagine that we chose 
to sample at a point x in the interval C'i and received 
a value y (belonging to Tv{x)). This adds one more 
interval and divides C'i into two intervals C'n and C'i2 as 
shown in fig. 14.. We also obtain two correspondingly 
smaller parallelograms within which the target function 
is now constrained to lie. 

The addition of this new data point to the data set 
(V' = V U (x, y)) requires us to recompute the learner's 
hypothesis (denoted by b! in the fig. 14). Correspond- 
ingly, it also requires us to update ec, i.e., we now need 
to compute ec{1~t, V ,T). First we observe that the addi- 
tion of the new data point does not affect the uncertainty 
measure on any interval other than the divided interval 
C'i. This is clear when we notice that the parallelograms 
(whose area denotes the uncertainty on each interval) 
for all the other intervals are unaffected by the new data 
point. 

Thus, 

e C] {n,V',T) = e C] {U,V,T) = -L(d 2 B?-A?) for j ^ i 

For the ith interval C'i, the total uncertainty is now re- 
computed as (half the sum of the two parallelograms in 
fig. 14) 



e Ci (H,V,r) = ^((rf 2 « 2 - ^ 2 )+ 

(d^B, - uf - (A t - vf)) 

= ± ((rf 2 « 2 + d 2 (B t - uf) - (v 2 + (A 



(6) 



,) 2 )) 



14 



where u = x — Xi, v = y—yi, and Ai and Bi are as before. 
Note that u ranges from to Bi, for Xi < x < Xi + \. 
However, given a particular choice of 2; (this fixes a value 
of u), the possible values v can take are constrained by 
the geometry of the parallelogram. In particular, v can 



only lie within the parallelogram. For a particular x, 
we know that Tv(x) represents the set of all possible y 
values we can receive. Since v = y — yi , it is clear that 
v G Tv{x) — yi- Naturally, if y < yi, we find that v < 0, 
and Ai — v > A{. Similarly, if y > j/i+i, we find that 
v > Ai. 

We now prove the following lemma: 

Lemma 4 The following two identities are valid for the 
appropriate mini-max problem. 
Identity 1: 

f = aigmm ue[0tB] sup ve{:F7}(x) _ yt} H^u^) 

where H x (u, v) = ((d 2 u 2 + d 2 (B - u) 2 )- 

(v 2 + (A-v) 2 )) 
Identity 2: 
\(d 2 B 2 - A 2 ) = min„ e[ o, B ] sup ve{T7}(x) _ yt} H 2 (u, v) 

where H 2 (u, v) = ((d 2 u 2 + d 2 (B - u) 2 )- 

(v 2 + (A-v) 2 )) 

Proof: The expression on the right is a difference 
of two quadratic expressions and can be expressed as 
qi(u) — q 2 (v). For a particular u, the expression is max- 
imized when the quadratic q 2 (v) = (v 2 + (A — v) 2 ) is 
minimized. Observe that this quadratic is globally min- 
imized at v = A/2. We need to perform this minimiza- 
tion over the set v G Tv(x) — yi (this is the set of values 
which lie within the upper and lower boundaries of the 
parallelogram shown in fig. 15). There are three cases to 
consider. 

Case I: u G [A/2d, B - A/ 2d] 

First, notice that for u in this range, it is easy to verify 
that the upper boundary of the parallelogram is greater 
than A/2 while the lower boundary is less than A/2. 
Thus we can find a value of v (viz. v = A/2) which glob- 
ally minimizes this quadratic because A/2 G Tv(x) — yi. 
The expression thus reduces to d 2 u 2 + d 2 (B — u) 2 — A 2 j2. 
Over the interval for u considered in this case, it is min- 
imized at u = 5/2 resulting in the value 

(d 2 B 2 - A 2 )/2 

Case II: u G [0,A/2d] 

In this case, the upper boundary of the parallelogram 
(which is the maximum value v can take) is less than 
A/2 and hence the q 2 (v) is minimized when v = du. The 
total expression then reduces to 

d 2 u 2 + d 2 (B - u) 2 - ((du) 2 + (A- du) 2 ) 

= d 2 (B - u) 2 -(A- du) 2 = (d 2 B 2 - A 2 ) - 2ud(dB - A) 

Since, dB > A, the above is minimized on this interval 
by choosing u = A/2d resulting in the value 

dB(dB - A) 

Case III: By symmetry, this reduces to case II. 

Since (d 2 B 2 -A 2 )/2 < dB(dB-A) (this is easily seen 
by completing squares), it follows that u = 5/2 is the 
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global solution of the mini-max problem above. Further, 
we have shown that for this value of u, the sup term 
reduces to (d 2 B 2 — A 2 )/2 and the lemma is proved. □ 
Using the above lemma along with eq. 6, we see that 

min^c, sup yeTv(x) e c ,(n,V U (x, y),T) = j- d (d 2 B 2 - A 2 ) 

In other words, by sampling the midpoint of the inter- 
val C'i, we are guaranteed to reduce the uncertainty by 
1/2. As in the case of monotonic functions now, we see 
that using eq. 5, we should sample the midpoint of the 
interval with largest uncertainty ec i ( / H,'D,J 7 ) to obtain 
the global solution in accordance with the principle of 
Algorithm B of section 2. 

This allows us to formally state an active learning 
algorithm which is optimal in the sense implied in our 
formulation. 

The Choose and Learn Algorithm - 2 (CLA-2) 

1. [Initial Step] Ask for values of the function 
at points x = and x = 1. At this stage, 
the domain D = [0, 1] is composed of one in- 
terval only, viz., C\ = [0,1]. Compute ec 1 = 

h ( rf2 " l/C 1 ) " /(°)| 2 ) and e D = e Cl • H e D < e, 
stop and output the linear interpolant of the sam- 
ples as the hypothesis, otherwise query the mid- 
point of the interval to get a partition of the domain 
into two subintervals [0, 1/2) and [1/2, 1]. 

2. [General Update and Stopping Rule] In gen- 
eral, at the Mh stage, suppose that our partition 
of the interval [0, 1] is [xo = 0, x\),[xi, x 2 ), ■ ■ ■ , 
[xk-ijXk = 1]. We compute the uncertainty 

ec * = Td ( d2 ( x i ~ x i-if ~ \Vi ~ Vi-i?) for each 
i = l,...,k. The midpoint of the interval with 
maximum ec, is queried for the next sample. The 
total error eo = ^2 i=1 ec, is computed at each 
stage and the process is terminated when eo < e. 
Our hypothesis h at every stage is a linear interpo- 
lation of all the points sampled so far and our final 
hypothesis is obtained upon the termination of the 
whole process. 

It is possible to show that the following upperbound 
exists on the number of examples CLA would take to 
learn the class of functions in consideration 

Theorem 6 The CLA-2 would PAC learn the class in 
at most 4^ + 1 examples. 

Proof Sketch: Following a strategy similar to the proof 
of Theorem 3, we show how a slight variant of CLA-2 
would converge in at most (d/Ae-\- 1) examples. Imagine 
a grid of n points placed \/(n — 1) apart on the domain 
D = [0, 1] where the kth point is k/(n — 1) (for k going 
from to n — 1). The variant of the CLA-2 operates 
by confining its queries to points on this grid. Thus 
at the kth stage, instead of querying the midpoint of 
the interval with maximum uncertainty, it will query the 
gridpoint closest to this midpoint. Suppose it uses up all 
the gridpoints in this fashion, then there will be n — 1 
intervals and by our arguments above, we have seen that 



the maximum error on each interval is bounded by 

5<«^rrT>'-l» -*-.!'> <5ArV 

Since there are n — 1 such intervals, the total error it 
could make is bounded by 



V J Ad K n-1 



) =—( 



1 



1 



Ad n - 1 



It is easy to show that for n > d/Ae + 1, this maximum 
error is less than e. Thus the learner need not collect any 
more than d/Ae + 1 examples to learn the target function 
to within an e accuracy. Note that the learner will have 
identified the target to e accuracy with probability 1 (al- 
ways) by following the strategy outlined in this variant 
ofCLA-2. □ 

We now have both an upper and lower bound for PAC- 
learning the class (under a uniform distribution) with 
queries. Notice that here as well, the sample complexity 
of active learning does not depend upon the confidence 
parameter 6. Thus for 6 arbitrarily small, the difference 
in sample complexities between passive and active learn- 
ing becomes arbitrarily large with active learning requir- 
ing much fewer examples. 

4.3 Some Simulations 

We now provide some simulations conducted on arbi- 
trary functions of the class of functions with bounded 
derivative (the class T). Fig. 16 shows 4 arbitrary se- 
lected functions which were chosen to be the target func- 
tion for the approximation scheme considered. In par- 
ticular, we are interested in observing how the active 
strategy samples the target function for each case. Fur- 
ther, we are interested in comparing the active and pas- 
sive techniques with respect to error rates for the same 
number of examples drawn. In this case, we have been 
unable to derive an analytical solution to the classical 
optimal recovery problem. Hence, we do not compare it 
as an alternative sampling strategy in our simulations. 

4.3.1 Distribution of points selected 

The active algorithm CLA-2 selects points adaptively 
on the basis of previous examples received. Thus the 
distribution of the sample points in the domain D of 
the function depends inherently upon the arbitrary tar- 
get function. Consider for example, the distribution of 
points when the target function is chosen to be Function- 
1 of the set shown in fig. 16. 

Notice (as shown in fig. 17) that the algorithm chooses 
to sample densely in places where the target is flat, and 
less densely where the function has a steep slope. As our 
mathematical analysis of the earlier section showed, this 
is well founded. Roughly speaking, if the function has 
the same value at X{ and a?;+i, then it could have a vari- 
ety of values (wiggle a lot) within. However, if, f(xi + i) 
is much greater (or less) than f(xi), then, in view of the 
bound, d, on how fast it can change, it would have had to 
increase (or decrease) steadily over the interval. In the 
second case, the rate of change of the function over the 
interval is high, there is less uncertainty in the values of 
the function within the interval, and consequently fewer 
samples are needed in between. 
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In example 1, for the case of monotone functions, we 
saw that the density of sample points was proportional 
to the first derivative of the target function. By contrast, 
in this example, the optimal strategy chooses to sample 
points in a way which is inversely proportional to the 
magnitude of the first derivative of the target function. 
Fig. 18 exemplifies this. 

4.3.2 Error Rates: 

In an attempt to relate the number of examples drawn 
and the error made by the learner, we performed the 
following simulation. 
Simulation B: 

1. Pick an arbitrary function from class T . 

2. Decide N, the number of samples to be collected. 
There are two methods of collection of samples. 
The first (passive) is by randomly drawing N ex- 
amples according to a uniform distribution on [0, 1]. 
The second (active) is the CLA-2. 

3. The two learning algorithms differ only in their 
method of obtaining samples. Once the samples 
are obtained, both algorithms attempt to approx- 
imate the target by the linear interpolant of the 
samples (first order splines). 

4. This entire process is now repeated for various val- 
ues of N for the same target function and then re- 
peated again for the four different target functions 
of fig. 16 

The results are shown in fig. 19. Notice how the active 
learner outperforms the passive learner. For the same 
number of examples, the active scheme having chosen its 
examples optimally by our algorithm makes less error. 

We have obtained in theorem 6, an upper bound on 
the performance of the active learner. However, as we 
have already remarked earlier, the number of examples 
the active algorithm takes before stopping (i.e., out- 
putting an e-good approximation) varies and depends 
upon the nature of the target function. "Simple" func- 
tions are learned quickly, "difficult" functions are learned 
slowly. As a point of interest, we have shown in fig. 20, 
how the actual number of examples drawn varies with e. 
In order to learn a target function to e-accuracy, CLA-2 
needs at most n ma x(f) = d/Ae + 1 examples. However, 
for a particular target function, /, let the number of ex- 
amples it actually requires be rij(e). We plot n " J V c \ as a 
function off. Notice, first, that this ratio is always much 
less than 1. In other words, the active learner stops 
before the worst case upper bound with a guaranteed 
e-good hypothesis. This is the significant advantage of 
an adaptive sampling scheme. Recall that for uniform 
sampling (or classical optimal recovery even) we would 
have no choice but to ask for d/Ae examples to be sure of 
having an e-good hypothesis. Further, notice that that 
as e gets smaller, the ratio gets smaller. This suggests 
that for these functions, the sample complexity of the 
active learner is of a different order (smaller) than the 
worst case bound. Of course, there always exists some 
function in T which would force the active learner to 
perform at its worst case sample complexity level. 
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crucial difference. The distance metric, d, is not neces- 
sarily related to the distribution according to which data 
is drawn (in the passive case). This prevents us from us- 
ing traditional uniform convergence (Vapnik, 1982) type 
arguments to prove learnability. The problem of learning 
under a different metric is an interesting one and merits 
further investigation in its own right. 
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Figure 12: An arbitrary data set for the case of functions 
with a bounded derivative. The functions in Tv are 
constrained to lie in the parallelograms as shown. The 
slopes of the lines making up the parallelogram are d and 
— d appropriately. 



< x ;. y ,> 





Figure 15: A figure to help the visualization of Lemma 4. 
For the x shown, the set Tv is the set of all values which 
lie within the parallelogram corresponding to this x, i.e., 
on the vertical line drawn at x but within the parallelo- 
gram. 



Figure 13: A zoomed version of the ith interval. 
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Figure 17: How CLA-2 chooses to sample its points. 
Vertical lines have been drawn at the x values where the 
CLA queried the oracle for the corresponding function 
value. 



Figure 14: Subdivision of the ith interval when a new 
data point is obtained. 
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